Skip to content

Assessment: Gemini Batch Fix & Dataset Preview Row Limiting#820

Open
vprashrex wants to merge 5 commits into
mainfrom
chore/assessment-gemini-batch-fix
Open

Assessment: Gemini Batch Fix & Dataset Preview Row Limiting#820
vprashrex wants to merge 5 commits into
mainfrom
chore/assessment-gemini-batch-fix

Conversation

@vprashrex
Copy link
Copy Markdown
Collaborator

@vprashrex vprashrex commented May 9, 2026

Target issue: #830

Summary

This pull request addresses two key issues within the assessment module:

  1. Fixed bugs in Gemini batch processing to improve the reliability and stability of AI assessment execution and testing workflows.
  2. Added limit_row support to the dataset preview endpoint, allowing clients to fetch only a limited number of dataset rows instead of the full dataset.
    Previously, large dataset responses caused frontend browser lag and UI freezes due to excessive data rendering on the client side. By introducing row limiting, the frontend can now request lightweight previews (for example, 5 rows), resulting in improved performance and a smoother user experience.

Checklist

Before submitting a pull request, please ensure that you mark these task.

  • Ran fastapi run --reload app/main.py or docker compose up in the repository root and test.
  • If you've fixed a bug or added code that is tested and has test cases.

Notes

Please add here if any other information is required for the reviewer.

Summary by CodeRabbit

  • New Features

    • GET dataset endpoint can return a lightweight preview (column headers + first N rows) via optional limit_rows (1–100).
  • Documentation

    • API docs updated to describe the new limit_rows preview option and its behavior.
  • Refactor

    • Batch submission format now emits row identifiers at the top level for clearer row tracking.
  • Bug Fixes

    • Preview requests return appropriate HTTP error responses for missing/invalid or unsupported files.
  • Tests

    • Test coverage added/updated for dataset preview behavior and batch identifier location.

Review Change Stack

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 9, 2026

📝 Walkthrough

Walkthrough

Adds optional dataset preview (headers + first N rows) to GET /datasets/{id} with models, service parsing CSV/XLSX, docs and tests; moves Gemini/Google JSONL row identifier to a top-level key (tests updated).

Changes

Assessment dataset preview

Layer / File(s) Summary
Preview Pydantic models
backend/app/models/assessment.py
Adds AssessmentDatasetPreview and an optional preview field to AssessmentDatasetResponse.
Preview parsing service
backend/app/services/assessment/dataset.py
Adds _stringify, _preview_csv, _preview_excel, and preview_dataset to fetch and parse CSV/XLSX previews and return headers + rows, with HTTP error handling.
API handler and wiring for preview
backend/app/api/routes/assessment/datasets.py
Imports preview types/service, extends _dataset_to_response to accept preview, adds limit_rows query param, builds AssessmentDatasetPreview when requested, and includes it in responses.
Endpoint docs
backend/app/api/docs/assessment/get_dataset.md
Documents limit_rows (1–100) parameter and that omitting it avoids fetching the underlying file.
Preview tests
backend/app/tests/assessment/test_dataset.py, backend/app/tests/assessment/test_routes.py
Adds tests for CSV/XLSX preview outputs, encoding fallbacks, error cases, and route-level preview behavior.

Gemini Batch JSONL Schema

Layer / File(s) Summary
JSONL Row Identifier Schema
backend/app/crud/assessment/batch.py, backend/app/tests/assessment/test_batch.py
build_google_jsonl now emits row identifier as top-level key instead of metadata.key; test assertion updated accordingly.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested labels

enhancement

Suggested reviewers

  • kartpop
  • AkhileshNegi
  • Ayush8923

Poem

🐰 A key pops up where it’s easy to see,
Preview hops in with a header and three,
CSV and sheets, trimmed neat and sweet,
Rows and columns met on a tiny treat,
Hooray — the backend’s lighter on its feet!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 19.23% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title check ✅ Passed The title accurately summarizes the two main changes: a Gemini batch schema fix and dataset preview row limiting functionality.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch chore/assessment-gemini-batch-fix

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@vprashrex vprashrex requested a review from Prajna1999 May 11, 2026 05:34
@vprashrex vprashrex self-assigned this May 11, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@backend/app/api/routes/assessment/datasets.py`:
- Around line 149-161: The truncated flag is over-reported because the code
treats len(rows) >= limit_rows as truncated; to fix, request one extra row from
preview_assessment_dataset (call with limit=limit_rows + 1), set truncated =
len(rows) > limit_rows, and if truncated trim rows to the original limit_rows
before constructing AssessmentDatasetPreview (use the existing names session,
dataset, limit_rows, preview_assessment_dataset, headers, rows, and
AssessmentDatasetPreview).

In `@backend/app/services/assessment/dataset.py`:
- Around line 197-219: The current preview logic defaults to CSV for any
non-".xlsx" file_ext which can silently mis-handle missing/invalid metadata;
update the preview path in the preview function (where file_ext is derived) to
validate file_ext explicitly (normalize with .lower() and strip), and only allow
known extensions like ".xlsx" and ".csv"; if file_ext is None or not in the
allowed set, raise HTTPException(status_code=422, detail="Unsupported or missing
file extension.") instead of calling _preview_csv, otherwise call _preview_excel
for ".xlsx" and _preview_csv for ".csv".
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 60a75072-f1b1-49b2-b789-0b3989427bea

📥 Commits

Reviewing files that changed from the base of the PR and between e08abbc and 15ad20d.

📒 Files selected for processing (5)
  • backend/app/api/docs/assessment/get_dataset.md
  • backend/app/api/routes/assessment/datasets.py
  • backend/app/models/assessment.py
  • backend/app/services/assessment/dataset.py
  • backend/app/tests/assessment/test_batch.py
✅ Files skipped from review due to trivial changes (1)
  • backend/app/api/docs/assessment/get_dataset.md

Comment thread backend/app/api/routes/assessment/datasets.py
Comment on lines +197 to +219
file_ext = (dataset.dataset_metadata or {}).get("file_extension")
if file_ext == ".xls":
raise HTTPException(
status_code=422,
detail="Legacy Excel format (.xls) is not supported.",
)

storage = get_cloud_storage(session=session, project_id=project_id)
try:
content = storage.get(dataset.object_store_url)
except Exception as e:
logger.warning(
f"[preview_dataset] Failed to fetch file | dataset_id={dataset.id} | {e}",
exc_info=True,
)
raise HTTPException(
status_code=502, detail="Failed to fetch dataset file from storage."
) from e

try:
if file_ext == ".xlsx":
return _preview_excel(content, limit)
return _preview_csv(content, limit)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Reject unknown/missing file extensions instead of defaulting to CSV.

Line 219 defaults to CSV parsing for non-.xlsx values. If metadata is missing/incorrect, preview can return garbage instead of a clear 422.

Suggested fix
-    file_ext = (dataset.dataset_metadata or {}).get("file_extension")
+    file_ext = ((dataset.dataset_metadata or {}).get("file_extension") or "").lower()
     if file_ext == ".xls":
         raise HTTPException(
             status_code=422,
             detail="Legacy Excel format (.xls) is not supported.",
         )
+    if file_ext not in {".csv", ".xlsx"}:
+        raise HTTPException(
+            status_code=422,
+            detail="Unsupported or missing dataset file extension for preview.",
+        )
...
-        if file_ext == ".xlsx":
+        if file_ext == ".xlsx":
             return _preview_excel(content, limit)
         return _preview_csv(content, limit)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
file_ext = (dataset.dataset_metadata or {}).get("file_extension")
if file_ext == ".xls":
raise HTTPException(
status_code=422,
detail="Legacy Excel format (.xls) is not supported.",
)
storage = get_cloud_storage(session=session, project_id=project_id)
try:
content = storage.get(dataset.object_store_url)
except Exception as e:
logger.warning(
f"[preview_dataset] Failed to fetch file | dataset_id={dataset.id} | {e}",
exc_info=True,
)
raise HTTPException(
status_code=502, detail="Failed to fetch dataset file from storage."
) from e
try:
if file_ext == ".xlsx":
return _preview_excel(content, limit)
return _preview_csv(content, limit)
file_ext = ((dataset.dataset_metadata or {}).get("file_extension") or "").lower()
if file_ext == ".xls":
raise HTTPException(
status_code=422,
detail="Legacy Excel format (.xls) is not supported.",
)
if file_ext not in {".csv", ".xlsx"}:
raise HTTPException(
status_code=422,
detail="Unsupported or missing dataset file extension for preview.",
)
storage = get_cloud_storage(session=session, project_id=project_id)
try:
content = storage.get(dataset.object_store_url)
except Exception as e:
logger.warning(
f"[preview_dataset] Failed to fetch file | dataset_id={dataset.id} | {e}",
exc_info=True,
)
raise HTTPException(
status_code=502, detail="Failed to fetch dataset file from storage."
) from e
try:
if file_ext == ".xlsx":
return _preview_excel(content, limit)
return _preview_csv(content, limit)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/app/services/assessment/dataset.py` around lines 197 - 219, The
current preview logic defaults to CSV for any non-".xlsx" file_ext which can
silently mis-handle missing/invalid metadata; update the preview path in the
preview function (where file_ext is derived) to validate file_ext explicitly
(normalize with .lower() and strip), and only allow known extensions like
".xlsx" and ".csv"; if file_ext is None or not in the allowed set, raise
HTTPException(status_code=422, detail="Unsupported or missing file extension.")
instead of calling _preview_csv, otherwise call _preview_excel for ".xlsx" and
_preview_csv for ".csv".

@codecov
Copy link
Copy Markdown

codecov Bot commented May 12, 2026

Codecov Report

❌ Patch coverage is 97.77778% with 4 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
backend/app/services/assessment/dataset.py 93.84% 4 Missing ⚠️

📢 Thoughts on this report? Let us know!

@vprashrex vprashrex changed the title Assessment (HotFix): Gemini Batch Fix Assessment: Gemini Batch Fix May 12, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (3)
backend/app/tests/assessment/test_dataset.py (3)

148-163: 💤 Low value

Consider importing openpyxl at module level.

openpyxl is already imported at line 7 for InvalidFileException. Importing it again inside the test function (lines 149, 151) is inconsistent with the module-level import pattern.

♻️ Proposed consolidation

At the top of the file, consolidate the imports:

 from openpyxl.utils.exceptions import InvalidFileException
+import openpyxl
+import io

Then remove the inline imports in the test functions.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/app/tests/assessment/test_dataset.py` around lines 148 - 163, The
test function test_preview_excel_returns_headers_and_rows imports openpyxl
locally even though openpyxl is already imported at module level for
InvalidFileException; remove the inline imports inside
test_preview_excel_returns_headers_and_rows and any other tests, and add/ensure
a single module-level import for openpyxl alongside InvalidFileException so the
test uses that top-level import instead.

142-146: ⚡ Quick win

Strengthen the latin-1 fallback assertion.

The test claims to verify latin-1 fallback but only checks that the value starts with "ca". It should verify that the invalid UTF-8 byte \xff was correctly decoded as ÿ (U+00FF in latin-1) rather than dropped.

✨ Proposed stronger assertion
     def test_preview_csv_handles_latin1_fallback(self) -> None:
         # \xff is invalid utf-8 -> falls back to latin-1
         headers, rows = _preview_csv(b"name\nca\xfffe\n", limit=5)
         assert headers == ["name"]
-        assert rows and rows[0][0].startswith("ca")
+        assert rows and rows[0][0] == "caÿfe"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/app/tests/assessment/test_dataset.py` around lines 142 - 146, Update
the test_preview_csv_handles_latin1_fallback assertion to verify the latin-1
decoded character is present: call _preview_csv as before, then assert that the
first cell exactly equals "caÿ" or contains the Unicode character U+00FF (ÿ) to
ensure the invalid UTF-8 byte 0xFF was decoded via latin-1; refer to the test
function name test_preview_csv_handles_latin1_fallback and the helper
_preview_csv when locating the change.

165-175: ⚡ Quick win

Clarify expected behavior for empty workbooks.

The assertion at line 174 accepts two different outcomes ([""] or []), which suggests either:

  1. The expected behavior for empty workbooks is not well-defined, or
  2. The test is being overly permissive.

Consider determining the correct expected behavior and asserting only that outcome.

♻️ Proposed fix

If empty workbooks should return an empty list:

-        assert headers == [""] or headers == []
+        assert headers == []

Or if they should return a list with one empty string:

-        assert headers == [""] or headers == []
+        assert headers == [""]
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/app/tests/assessment/test_dataset.py` around lines 165 - 175, The
test test_preview_excel_empty_workbook is ambiguous because it accepts two
outcomes for headers; decide the canonical behavior for _preview_excel (either
return [] for no headers or [""] to represent a single empty header) and update
the test to assert that single expected value only; locate the test function
test_preview_excel_empty_workbook and the helper _preview_excel, then change the
assertion to assert headers == <chosen_expected_value> (and keep assert rows ==
[]), or if you choose to change _preview_excel instead, make it return the
chosen headers shape for an empty workbook and keep the test asserting that one
outcome.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@backend/app/tests/assessment/test_dataset.py`:
- Around line 148-163: The test function
test_preview_excel_returns_headers_and_rows imports openpyxl locally even though
openpyxl is already imported at module level for InvalidFileException; remove
the inline imports inside test_preview_excel_returns_headers_and_rows and any
other tests, and add/ensure a single module-level import for openpyxl alongside
InvalidFileException so the test uses that top-level import instead.
- Around line 142-146: Update the test_preview_csv_handles_latin1_fallback
assertion to verify the latin-1 decoded character is present: call _preview_csv
as before, then assert that the first cell exactly equals "caÿ" or contains the
Unicode character U+00FF (ÿ) to ensure the invalid UTF-8 byte 0xFF was decoded
via latin-1; refer to the test function name
test_preview_csv_handles_latin1_fallback and the helper _preview_csv when
locating the change.
- Around line 165-175: The test test_preview_excel_empty_workbook is ambiguous
because it accepts two outcomes for headers; decide the canonical behavior for
_preview_excel (either return [] for no headers or [""] to represent a single
empty header) and update the test to assert that single expected value only;
locate the test function test_preview_excel_empty_workbook and the helper
_preview_excel, then change the assertion to assert headers ==
<chosen_expected_value> (and keep assert rows == []), or if you choose to change
_preview_excel instead, make it return the chosen headers shape for an empty
workbook and keep the test asserting that one outcome.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 3ab2e4d5-c58f-4731-ada1-cfe46d7ba28d

📥 Commits

Reviewing files that changed from the base of the PR and between fa5e476 and 0a20a8b.

📒 Files selected for processing (2)
  • backend/app/tests/assessment/test_dataset.py
  • backend/app/tests/assessment/test_routes.py

@vprashrex vprashrex requested a review from Ayush8923 May 12, 2026 13:59
@vprashrex vprashrex changed the title Assessment: Gemini Batch Fix Assessment: Gemini Batch Fix & Dataset Preview Row Limiting May 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants